Categorical Data in R

knitr::opts_chunk$set(message = FALSE, echo = FALSE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(vcd)
## Loading required package: grid
library(vcdExtra)
## Loading required package: gnm
## 
## Attaching package: 'vcdExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     summarise
library(sjPlot)
library(ggmosaic)
## 
## Attaching package: 'ggmosaic'
## 
## The following objects are masked from 'package:vcd':
## 
##     mosaic, spine
library(ggpubr)

#install.packages("openintro")
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

1 Introduction

Let us play with some categorical ( or predominantly categorical ) datasets in R and se how we can analyze and plot them.

1.1 The titanic dataset

1.1.1 titanic Bar Plots

1.1.2 titanic Mosaic Plot

1.1.3 titanic Balloon Plot

1.2 The hippocorpus dataset from Kaggle

This is a dataset from Kaggle and is based on Reference 2.

Hippocorpus is dataset of 6854 English diary like short stories about recalled and imagined events. Using a crowdsourcing framework the respective owners of this datasets collected recalled stories and summaries from workers, then provided these collected summaries to other workers who write imagined stories. Months later dataset creators collected a retold version of the recalled stories from the subset of recalled authors. Dataset contains author demographics (age, gender, race), their openness to experience, as well as some variables regarding the author’s relationship to the event (e.g., how personal the event is, how often they tell its story, etc.)

Apart from metadata pertaining to each respondent, there are 4 Likert Scale variables:

  • distracted: How distracted were you while writing your story? (5-point Likert)
  • draining: How taxing/draining was writing for you emotionally? (5-point Likert)
  • frequency: How often do you think about or talk about this event? (5-point Likert)
  • importance: How impactful, important, or personal is this story/this event to you? (5-point Likert). Plot these using the package sjPlot. Can you also try a ggplot?

1.3 A dataset from the vcdExtra package

Pick one of the fairly large Categorical datasets that are built into vcdExtra: type data(package = "vcdExtra") in your Console.

Create: - Contingency Table - A Bar Plot - A Mosaic Plot - A Balloon Plot

2 Conclusion

Write a few comments on the data and visualizations. Did they convey a story of sorts?

3 References

  1. A detailed analysis of the NHANES dataset, https://awagaman.people.amherst.edu/stat230/Stat230CodeCompilationExampleCodeUsingNHANES.pdf

  2. Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, and James Pennebaker (2020) Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models. ACL.

LS0tDQp0aXRsZTogIkNhdGVnb3JpY2FsIERhdGEgaW4gUiINCmF1dGhvcjogIkFydmluZCBWZW5rYXRhZHJpIg0KZGF0ZTogMjAyMy8xNi8wMQ0KbGFzdG1vZDogImByIFN5cy5EYXRlKClgIg0Kb3V0cHV0Og0KICBybWRmb3JtYXRzOjpyZWFkdGhlZG93bjoNCiAgICBoaWdobGlnaHQ6IGthdGUNCiAgICB0b2NfZmxvYXQ6IFRSVUUNCiAgICB0b2NfZGVwdGg6IDMNCiAgICBkZl9wcmludDogcGFnZWQNCiAgICBudW1iZXJfc2VjdGlvbnM6IFRSVUUNCiAgICBjb2RlX2ZvbGRpbmc6IHNob3cNCiAgICBjb2RlX2Rvd25sb2FkOiBUUlVFDQplZGl0b3Jfb3B0aW9uczogDQogIG1hcmtkb3duOiANCiAgICB3cmFwOiA3Mg0KLS0tDQoNCmBgYHtyIHNldHVwLCBpbmNsdWRlPVRSVUUscmVzdWx0cz0naG9sZCd9DQprbml0cjo6b3B0c19jaHVuayRzZXQobWVzc2FnZSA9IEZBTFNFLCBlY2hvID0gRkFMU0UpDQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmxpYnJhcnkodmNkKQ0KbGlicmFyeSh2Y2RFeHRyYSkNCmxpYnJhcnkoc2pQbG90KQ0KbGlicmFyeShnZ21vc2FpYykNCmxpYnJhcnkoZ2dwdWJyKQ0KDQojaW5zdGFsbC5wYWNrYWdlcygib3BlbmludHJvIikNCmxpYnJhcnkob3BlbmludHJvKQ0KDQpgYGANCg0KIyBJbnRyb2R1Y3Rpb24NCg0KTGV0IHVzIHBsYXkgd2l0aCBzb21lIGNhdGVnb3JpY2FsICggb3IgcHJlZG9taW5hbnRseSBjYXRlZ29yaWNhbCApDQpkYXRhc2V0cyBpbiBSIGFuZCBzZSBob3cgd2UgY2FuIGFuYWx5emUgYW5kIHBsb3QgdGhlbS4NCg0KIyMgVGhlIGB0aXRhbmljYCBkYXRhc2V0DQoNCmBgYHtyfQ0KZGF0YSgidGl0YW5pYyIpDQp0aXRhbmljDQoNCmBgYA0KDQoNCiMjIyBgdGl0YW5pY2AgQmFyIFBsb3RzDQoNCmBgYHtyIHRpdGFuaWMtYmFyLXBsb3R9DQoNCmBgYA0KDQoNCg0KDQojIyMgYHRpdGFuaWNgIE1vc2FpYyBQbG90DQoNCmBgYHtyIHRpdGFuaWMtbW9zYWljLXBsb3R9DQojIFRyeSB0aGUgbW9zYWljIHBhY2thZ2UgYW5kIHRoZSBnZ21vc2FpYyBwYWNrYWdlDQoNCmBgYA0KDQojIyMgYHRpdGFuaWNgIEJhbGxvb24gUGxvdA0KDQpgYGB7ciB0aXRhbmljLWJhbGxvb24tcGxvdH0NCg0KYGBgDQoNCg0KIyMgVGhlIGBoaXBwb2NvcnB1c2AgZGF0YXNldCBmcm9tIEthZ2dsZQ0KDQpgYGB7ciwgZWNobz1GQUxTRSxtZXNzYWdlPUZBTFNFfQ0KbGlicmFyeShkb3dubG9hZHRoaXMpDQpoaXBwbyA8LSByZWFkLmNzdigiZGF0YS9oaXBwb0NvcnB1c1YyLmNzdiIpDQpkb3dubG9hZF90aGlzKGhpcHBvLA0KICAgICNwYXRoID0gImRhdGEvaGlwcG9Db3JwdXNWMi5jc3YiLA0KICAgIG91dHB1dF9uYW1lID0gImhpcHBvY29ycHVzIiwNCiAgICBvdXRwdXRfZXh0ZW5zaW9uID0gIi5jc3YiLA0KICAgIGJ1dHRvbl9sYWJlbCA9ICJEb3dubG9hZCBkYXRhIGFzIGNzdiIsDQogICAgYnV0dG9uX3R5cGUgPSAiaW5mbyIsDQogICAgaGFzX2ljb24gPSBUUlVFLA0KICAgIGljb24gPSAiZmEgZmEtc2F2ZSINCiAgKQ0KDQpgYGANCg0KVGhpcyBpcyBhIGRhdGFzZXQgZnJvbQ0KW0thZ2dsZV0oaHR0cHM6Ly93d3cua2FnZ2xlLmNvbS9kYXRhc2V0cy9zYXVyYWJoc2hhaGFuZS9oaXBwb2NvcnB1cz9zZWxlY3Q9aGlwcG9Db3JwdXNWMi5jc3YpDQphbmQgaXMgYmFzZWQgb24gUmVmZXJlbmNlIDIuDQoNCj4gSGlwcG9jb3JwdXMgaXMgZGF0YXNldCBvZiA2ODU0IEVuZ2xpc2ggZGlhcnkgbGlrZSBzaG9ydCBzdG9yaWVzIGFib3V0DQo+IHJlY2FsbGVkIGFuZCBpbWFnaW5lZCBldmVudHMuIFVzaW5nIGEgY3Jvd2Rzb3VyY2luZyBmcmFtZXdvcmsgdGhlDQo+IHJlc3BlY3RpdmUgb3duZXJzIG9mIHRoaXMgZGF0YXNldHMgY29sbGVjdGVkIHJlY2FsbGVkIHN0b3JpZXMgYW5kDQo+IHN1bW1hcmllcyBmcm9tIHdvcmtlcnMsIHRoZW4gcHJvdmlkZWQgdGhlc2UgY29sbGVjdGVkIHN1bW1hcmllcyB0bw0KPiBvdGhlciB3b3JrZXJzIHdobyB3cml0ZSBpbWFnaW5lZCBzdG9yaWVzLiBNb250aHMgbGF0ZXIgZGF0YXNldA0KPiBjcmVhdG9ycyBjb2xsZWN0ZWQgYSByZXRvbGQgdmVyc2lvbiBvZiB0aGUgcmVjYWxsZWQgc3RvcmllcyBmcm9tIHRoZQ0KPiBzdWJzZXQgb2YgcmVjYWxsZWQgYXV0aG9ycy4gRGF0YXNldCBjb250YWlucyBhdXRob3IgZGVtb2dyYXBoaWNzIChhZ2UsDQo+IGdlbmRlciwgcmFjZSksIHRoZWlyIG9wZW5uZXNzIHRvIGV4cGVyaWVuY2UsIGFzIHdlbGwgYXMgc29tZSB2YXJpYWJsZXMNCj4gcmVnYXJkaW5nIHRoZSBhdXRob3IncyByZWxhdGlvbnNoaXAgdG8gdGhlIGV2ZW50IChlLmcuLCBob3cgcGVyc29uYWwNCj4gdGhlIGV2ZW50IGlzLCBob3cgb2Z0ZW4gdGhleSB0ZWxsIGl0cyBzdG9yeSwgZXRjLikNCg0KQXBhcnQgZnJvbSBtZXRhZGF0YSBwZXJ0YWluaW5nIHRvIGVhY2ggcmVzcG9uZGVudCwgdGhlcmUgYXJlIDQgKkxpa2VydA0KU2NhbGUqIHZhcmlhYmxlczoNCg0KLSAgIGBkaXN0cmFjdGVkYDogSG93IGRpc3RyYWN0ZWQgd2VyZSB5b3Ugd2hpbGUgd3JpdGluZyB5b3VyIHN0b3J5Pw0KICAgICg1LXBvaW50IExpa2VydCkNCi0gICBgZHJhaW5pbmdgOiBIb3cgdGF4aW5nL2RyYWluaW5nIHdhcyB3cml0aW5nIGZvciB5b3UgZW1vdGlvbmFsbHk/DQogICAgKDUtcG9pbnQgTGlrZXJ0KQ0KLSAgIGBmcmVxdWVuY3lgOiBIb3cgb2Z0ZW4gZG8geW91IHRoaW5rIGFib3V0IG9yIHRhbGsgYWJvdXQgdGhpcyBldmVudD8NCiAgICAoNS1wb2ludCBMaWtlcnQpDQotICAgYGltcG9ydGFuY2VgOiBIb3cgaW1wYWN0ZnVsLCBpbXBvcnRhbnQsIG9yIHBlcnNvbmFsIGlzIHRoaXMNCiAgICBzdG9yeS90aGlzIGV2ZW50IHRvIHlvdT8gKDUtcG9pbnQgTGlrZXJ0KS4gUGxvdCB0aGVzZSB1c2luZyB0aGUNCiAgICBwYWNrYWdlIGBzalBsb3RgLiBDYW4geW91IGFsc28gdHJ5IGEgYGdncGxvdGA/DQoNCmBgYHtyIGhpcHBvY29ycHVzLWxpa2VydH0NCg0KDQpgYGANCg0KDQojIyBBIGRhdGFzZXQgZnJvbSB0aGUgYHZjZEV4dHJhYCBwYWNrYWdlDQoNClBpY2sgb25lIG9mIHRoZSBmYWlybHkgbGFyZ2UgQ2F0ZWdvcmljYWwgZGF0YXNldHMgdGhhdCBhcmUgYnVpbHQgaW50byBgdmNkRXh0cmFgOiB0eXBlIGBkYXRhKHBhY2thZ2UgPSAidmNkRXh0cmEiKWAgaW4geW91ciBDb25zb2xlLg0KDQpDcmVhdGU6DQotIENvbnRpbmdlbmN5IFRhYmxlDQotIEEgQmFyIFBsb3QNCi0gQSBNb3NhaWMgUGxvdA0KLSBBIEJhbGxvb24gUGxvdA0KDQoNCiMgQ29uY2x1c2lvbg0KDQpXcml0ZSBhIGZldyBjb21tZW50cyBvbiB0aGUgZGF0YSBhbmQgdmlzdWFsaXphdGlvbnMuIERpZCB0aGV5IGNvbnZleSBhIHN0b3J5IG9mIHNvcnRzPw0KDQoNCiMgUmVmZXJlbmNlcw0KDQoxLiAgQSBkZXRhaWxlZCBhbmFseXNpcyBvZiB0aGUgTkhBTkVTIGRhdGFzZXQsDQogICAgPGh0dHBzOi8vYXdhZ2FtYW4ucGVvcGxlLmFtaGVyc3QuZWR1L3N0YXQyMzAvU3RhdDIzMENvZGVDb21waWxhdGlvbkV4YW1wbGVDb2RlVXNpbmdOSEFORVMucGRmPg0KDQoyLiAgTWFhcnRlbiBTYXAsIEVyaWMgSG9ydml0eiwgWWVqaW4gQ2hvaSwgTm9haCBBLiBTbWl0aCwgYW5kIEphbWVzDQogICAgUGVubmViYWtlciAoMjAyMCkgKlJlY29sbGVjdGlvbiB2ZXJzdXMgSW1hZ2luYXRpb246IEV4cGxvcmluZyBIdW1hbg0KICAgIE1lbW9yeSBhbmQgQ29nbml0aW9uIHZpYSBOZXVyYWwgTGFuZ3VhZ2UgTW9kZWxzLiogQUNMLg0K